Marketing Campaign Analysis¶

Problem Definition¶

The Context:¶

  • Why is this problem important to solve?

The objective:¶

  • What is the intended goal?

The key questions:¶

  • What are the key questions that need to be answered?

The problem formulation:¶

  • What is it that we are trying to solve using data science?

Data Dictionary¶


The dataset contains the following features:

  1. ID: Unique ID of each customer
  2. Year_Birth: Customer’s year of birth
  3. Education: Customer's level of education
  4. Marital_Status: Customer's marital status
  5. Kidhome: Number of small children in customer's household
  6. Teenhome: Number of teenagers in customer's household
  7. Income: Customer's yearly household income in USD
  8. Recency: Number of days since the last purchase
  9. Dt_Customer: Date of customer's enrollment with the company
  10. MntFishProducts: The amount spent on fish products in the last 2 years
  11. MntMeatProducts: The amount spent on meat products in the last 2 years
  12. MntFruits: The amount spent on fruits products in the last 2 years
  13. MntSweetProducts: Amount spent on sweet products in the last 2 years
  14. MntWines: The amount spent on wine products in the last 2 years
  15. MntGoldProds: The amount spent on gold products in the last 2 years
  16. NumDealsPurchases: Number of purchases made with discount
  17. NumCatalogPurchases: Number of purchases made using a catalog (buying goods to be shipped through the mail)
  18. NumStorePurchases: Number of purchases made directly in stores
  19. NumWebPurchases: Number of purchases made through the company's website
  20. NumWebVisitsMonth: Number of visits to the company's website in the last month
  21. AcceptedCmp1: 1 if customer accepted the offer in the first campaign, 0 otherwise
  22. AcceptedCmp2: 1 if customer accepted the offer in the second campaign, 0 otherwise
  23. AcceptedCmp3: 1 if customer accepted the offer in the third campaign, 0 otherwise
  24. AcceptedCmp4: 1 if customer accepted the offer in the fourth campaign, 0 otherwise
  25. AcceptedCmp5: 1 if customer accepted the offer in the fifth campaign, 0 otherwise
  26. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  27. Complain: 1 If the customer complained in the last 2 years, 0 otherwise

Note: You can assume that the data is collected in the year 2016.

Important Notes¶

  • This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for the Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook will give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise.

  • In the notebook, there are markdown cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.

  • The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.

  • All the outputs in the notebook are just for reference and can be different if you follow a different approach.

  • There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they wish to explore different techniques.

Loading Libraries¶

In [2540]:
!pip install yellowbrick
Requirement already satisfied: yellowbrick in /opt/anaconda3/lib/python3.9/site-packages (1.5)
Requirement already satisfied: scipy>=1.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.6.1)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (3.5.2)
Requirement already satisfied: scikit-learn>=1.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.0.2)
Requirement already satisfied: numpy>=1.16.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.21.5)
Requirement already satisfied: cycler>=0.10.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (0.11.0)
Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (9.2.0)
Requirement already satisfied: pyparsing>=2.2.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.9)
Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.25.0)
Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.2)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (21.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.2.0)
Requirement already satisfied: joblib>=0.11 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.1.0)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
In [2541]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To scale the data using z-score
from sklearn.preprocessing import StandardScaler

# To compute distances
from scipy.spatial.distance import cdist

# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# To visualize the elbow curve and Silhouette scores
from yellowbrick.cluster import SilhouetteVisualizer

# Importing PCA
from sklearn.decomposition import PCA

# To encode the variable
from sklearn.preprocessing import LabelEncoder

# Importing TSNE
from sklearn.manifold import TSNE

# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# To compute distances
from scipy.spatial.distance import pdist

# To import K-Medoids
from sklearn_extra.cluster import KMedoids

# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# Importing DBSCAN
from sklearn.cluster import DBSCAN

Let us load the data¶

In [2542]:
# loading the dataset
data = pd.read_csv("marketing_campaign.csv")

Check the shape of the data¶

In [2543]:
# Print the shape of the data

data.shape
Out[2543]:
(2240, 27)

Observations and Insights: _¶

There are 2,240 unique observations in the data and 27 columns.

Understand the data by observing a few rows¶

In [2544]:
# View first 5 rows
data.head()
Out[2544]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 10 4 7 0 0 0 0 0 0 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 1 2 5 0 0 0 0 0 0 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 2 10 4 0 0 0 0 0 0 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 0 4 6 0 0 0 0 0 0 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 ... 3 6 5 0 0 0 0 0 0 0

5 rows × 27 columns

In [2545]:
# View last 5 rows Hint: Use tail() method
data.tail()
Out[2545]:
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
2235 10870 1967 Graduation Married 61223.0 0 1 13-06-2013 46 709 ... 3 4 5 0 0 0 0 0 0 0
2236 4001 1946 PhD Together 64014.0 2 1 10-06-2014 56 406 ... 2 5 7 0 0 0 1 0 0 0
2237 7270 1981 Graduation Divorced 56981.0 0 0 25-01-2014 91 908 ... 3 13 6 0 1 0 0 0 0 0
2238 8235 1956 Master Together 69245.0 0 1 24-01-2014 8 428 ... 5 10 3 0 0 0 0 0 0 0
2239 9405 1954 PhD Married 52869.0 1 1 15-10-2012 40 84 ... 1 4 7 0 0 0 0 0 0 1

5 rows × 27 columns

Observations and Insights: _¶

Due to the size of our data set we can make some overall observations by looking at the first 5 rows (the head) and the last 5 rows (the tail) of our data. We can see that we have integers, objects, and floats in our data set. We have lot of booleans that indicate the customers interaction with some variables including their interactions with various campaigns as well as their purchase history and complaint record. This categorical data will be useful.

Customers have a unique id number

We are able to see their education level, marital status, income, and family size which will help us separate them into different groups.

We can see how long it has been since they have done any shopping and how much they have spent on select food groupings over the last 2 years.

There are 3 categories for how customers are making their purchases: through the website, catalog or in store.

There have been 5 advertising campaings but in these 10 total rows it does not look like there has been very much customer interacting with them. There is room for improvement in this area. We can compare the customer campaign interaction with their use of the discounts that have been offered to see the relationship.

Let us check the data types and and missing values of each column¶

In [2546]:
# Check the datatypes of each column. Hint: Use info() method
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Response             2240 non-null   int64  
dtypes: float64(1), int64(23), object(3)
memory usage: 472.6+ KB
In [2547]:
# Find the percentage of missing values in each column of the data

data.isnull().sum()
Out[2547]:
ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Response                0
dtype: int64

Observations and Insights: _¶

We can observe that ID has no null values. Also the number of unique values are equal to the number of observations. So, ID looks like an index for the data entry and such a column would not be useful in providing any predictive power for our analysis. Hence, it can be dropped.

Income has 24 missing values

Dropping the ID column

In [2548]:
# Remove ID column from data. Hint: Use inplace = True
data.drop(columns = ['ID'], inplace=True)

Exploratory Data Analysis¶

Let us now explore the summary statistics of numerical variables¶

In [2549]:
# Explore basic summary statistics of numeric variables. Hint: Use describe() method.
data.describe().T
Out[2549]:
count mean std min 25% 50% 75% max
Year_Birth 2240.0 1968.805804 11.984069 1893.0 1959.00 1970.0 1977.00 1996.0
Income 2216.0 52247.251354 25173.076661 1730.0 35303.00 51381.5 68522.00 666666.0
Kidhome 2240.0 0.444196 0.538398 0.0 0.00 0.0 1.00 2.0
Teenhome 2240.0 0.506250 0.544538 0.0 0.00 0.0 1.00 2.0
Recency 2240.0 49.109375 28.962453 0.0 24.00 49.0 74.00 99.0
MntWines 2240.0 303.935714 336.597393 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 26.302232 39.773434 0.0 1.00 8.0 33.00 199.0
MntMeatProducts 2240.0 166.950000 225.715373 0.0 16.00 67.0 232.00 1725.0
MntFishProducts 2240.0 37.525446 54.628979 0.0 3.00 12.0 50.00 259.0
MntSweetProducts 2240.0 27.062946 41.280498 0.0 1.00 8.0 33.00 263.0
MntGoldProds 2240.0 44.021875 52.167439 0.0 9.00 24.0 56.00 362.0
NumDealsPurchases 2240.0 2.325000 1.932238 0.0 1.00 2.0 3.00 15.0
NumWebPurchases 2240.0 4.084821 2.778714 0.0 2.00 4.0 6.00 27.0
NumCatalogPurchases 2240.0 2.662054 2.923101 0.0 0.00 2.0 4.00 28.0
NumStorePurchases 2240.0 5.790179 3.250958 0.0 3.00 5.0 8.00 13.0
NumWebVisitsMonth 2240.0 5.316518 2.426645 0.0 3.00 6.0 7.00 20.0
AcceptedCmp3 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp4 2240.0 0.074554 0.262728 0.0 0.00 0.0 0.00 1.0
AcceptedCmp5 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp1 2240.0 0.064286 0.245316 0.0 0.00 0.0 0.00 1.0
AcceptedCmp2 2240.0 0.012946 0.113069 0.0 0.00 0.0 0.00 1.0
Complain 2240.0 0.009375 0.096391 0.0 0.00 0.0 0.00 1.0
Response 2240.0 0.149107 0.356274 0.0 0.00 0.0 0.00 1.0

Observations and Insights: _¶

There is significant discrepency between the third quartile and the max for the income, number of catalog purchases as well as the overall food purchases particularly with the fruit, meat, sweet and gold products. This tells us that we have may outliers to the right.

Birth year has a very wide range from 1893-1996, almost 100 years. Average birth year is 1968.

Food purchase categories have a very broad range as well. The minimum purchase in two years for both fruit and sweets is only $1.

Let us also explore the summary statistics of all categorical variables and the number of unique observations in each category¶

In [2550]:
# List of the categorical columns in the data
cols = ["Education", "Marital_Status", "Kidhome", "Teenhome", "Complain"]

Number of unique observations in each category

In [2551]:
for column in cols:
    print("Unique values in", column, "are :")
    print(data[column].value_counts())
    print("*" * 50)
Unique values in Education are :
Graduation    1127
PhD            486
Master         370
2n Cycle       203
Basic           54
Name: Education, dtype: int64
**************************************************
Unique values in Marital_Status are :
Married     864
Together    580
Single      480
Divorced    232
Widow        77
Alone         3
Absurd        2
YOLO          2
Name: Marital_Status, dtype: int64
**************************************************
Unique values in Kidhome are :
0    1293
1     899
2      48
Name: Kidhome, dtype: int64
**************************************************
Unique values in Teenhome are :
0    1158
1    1030
2      52
Name: Teenhome, dtype: int64
**************************************************
Unique values in Complain are :
0    2219
1      21
Name: Complain, dtype: int64
**************************************************

Observations and Insights: _¶

There are 5 different categories under education

There are 8 different categories under marital status with Alone, Absurd and YOLO having less than 5 observations. The name of these categories and their low count should be a red flag as to their importance.

There are 3 categories in how many kids are in the home. They are 0, 1, and 2.

There are 3 categories in how many teenagers are in the home. They are 0, 1, and 2.

There are 2 categories for complaints, 1 for yes and 0 for no in the last two years.

Think About It:

  • We could observe from the summary statistics of categorical variables that the Education variable has 5 categories. Are all categories different from each other or can we combine some categories? Is 2n Cycle different from Master?
  • Similarly, there are 8 categories in Marital_Status with some categories having very low count of less than 5. Can we combine these categories with other categories?

Let us replace the "2n Cycle" category with "Master" in Education and "Alone", "Absurd, and "YOLO" with "Single" in Marital_Status¶

In [2552]:
# Replace the category "2n Cycle" with the category "Master"

data["Education"].replace("2n Cycle", "Master", inplace= True) # Hint: Use the replace() method and inplace=True
In [2553]:
# Replace the categories "Alone", "Abusrd", "YOLO" with the category "Single"

data["Marital_Status"].replace("Alone", "Single", inplace= True) # Hint: Use the replace() method and inplace=True
data["Marital_Status"].replace("Absurd", "Single", inplace= True)
data["Marital_Status"].replace("YOLO", "Single", inplace= True)
In [2554]:
# Checking to make sure that "Alone", "Absurd", and "YOLO" are no longer listed. 
data['Marital_Status'].unique().tolist()
Out[2554]:
['Single', 'Together', 'Married', 'Divorced', 'Widow']
In [2555]:
# Checking the count to make sure they are still included in the data set. 
data['Marital_Status'].describe()
Out[2555]:
count        2240
unique          5
top       Married
freq          864
Name: Marital_Status, dtype: object

Univariate Analysis¶

Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.

1. Univariate Analysis - Numerical Data¶

Histograms help to visualize and describe numerical data. We can also use other plots like box plot to analyze the numerical columns.

Let us plot histogram for the feature 'Income' to understand the distribution and outliers, if any.¶

In [2556]:
# Create histogram for the Income feature

plt.figure(figsize=(15, 7))
sns.histplot(x='Income', data=data)
plt.show()

We could observe some extreme value on the right side of the distribution of the 'Income' feature. Let's use a box plot as it is more suitable to identify extreme values in the data.

In [2557]:
# Plot the boxplot
sns.boxplot(data=data, x='Income', showmeans=True, color="violet")
Out[2557]:
<AxesSubplot:xlabel='Income'>

Observations and Insights: _¶

Outliers are defined as a sample or even that is very inconsistent with the rest of the data set. The observation point or value would be distant from the other observations in the data set. We can see some outliers here that should be taken care of.

Think About It

  • The histogram and the box plot are showing some extreme value on the right side of the distribution of the 'Income' feature. Can we consider them as outliers and remove or should we analyze these extreme values?
In [2558]:
# Calculating the upper whisker for the Income variable

Q1 = data.quantile(q=0.25)                          # Finding the first quartile

Q3 = data.quantile(q=0.75)                          # Finding the third quartile

IQR = Q3 - Q1                                      # Finding the Inter Quartile Range

upper_whisker = (Q3 + 1.5 * IQR)['Income'].max()   # Calculating the Upper Whisker for the Income variable

print(upper_whisker)                                # Printing Upper Whisker
118350.5
In [2559]:
# Let's check the observations with extreme value for the Income variable
data[data.Income > upper_whisker]
Out[2559]:
Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines MntFruits ... NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response
164 1973 PhD Married 157243.0 0 1 01-03-2014 98 20 2 ... 22 0 0 0 0 0 0 0 0 0
617 1976 PhD Together 162397.0 1 1 03-06-2013 31 85 1 ... 0 1 1 0 0 0 0 0 0 0
655 1975 Graduation Divorced 153924.0 0 0 07-02-2014 81 1 1 ... 0 0 0 0 0 0 0 0 0 0
687 1982 PhD Married 160803.0 0 0 04-08-2012 21 55 16 ... 28 1 0 0 0 0 0 0 0 0
1300 1971 Master Together 157733.0 1 0 04-06-2013 37 39 1 ... 0 1 1 0 0 0 0 0 0 0
1653 1977 Graduation Together 157146.0 0 0 29-04-2013 13 1 0 ... 28 0 1 0 0 0 0 0 0 0
2132 1949 PhD Married 156924.0 0 0 29-08-2013 85 2 1 ... 0 0 0 0 0 0 0 0 0 0
2233 1977 Graduation Together 666666.0 1 0 02-06-2013 23 9 14 ... 1 3 6 0 0 0 0 0 0 0

8 rows × 26 columns

Think About It:

  • We observed that there are only a few rows with extreme values for the Income variable. Is that enough information to treat (or not to treat) them? Do we know at what percentile the upper whisker lies?
In [2560]:
# Check the 99.5% percentile value for the Income variable
data.quantile(q=.995)['Income']
Out[2560]:
102145.75000000003

Observations and Insights: _¶

3 Standard Deviations from the Mean: 99.5% A value that falls outside of 3 standard deviations is part of the distribution, but it is an unlikely or rare event. For income anything above 102145.75000000003 is also above the 99.5% percentile value and can be considered an outlier.

In [2561]:
# Dropping observations identified as outliers 
data.drop(data[data['Income'] >= 153924.0].index, inplace = True) #Pass the indices of the observations (separated by a comma) to drop them

Now, let's check the distribution of the Income variable after dropping outliers.

In [2562]:
# Plot histogram and 'Income'
plt.figure(figsize=(15, 7))
sns.histplot(x='Income', data=data)

plt.title('Distribution of Income')
plt.xlabel('Income')
plt.ylabel('Count')
Out[2562]:
Text(0, 0.5, 'Count')
In [2563]:
# Plot the histogram for 'MntWines'
plt.figure(figsize=(15, 7))
sns.histplot(x='MntWines', data=data)


plt.title('Distribution of Mnt Wines Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')

plt.show()
In [2564]:
# Plot the histogram for 'MntFruits'
plt.figure(figsize=(15, 7))
sns.histplot(x='MntFruits', data=data)

plt.title('Distribution of Mnt Fruits Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')

plt.show()
In [2565]:
# Plot the histogram for 'MntMeatProducts' 
plt.figure(figsize=(15, 7))
sns.histplot(x='MntMeatProducts', data=data)

plt.title('Distribution of Mnt Meat Products Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')

plt.show()
In [2566]:
# Plot the histogram for 'MntFishProduct'
plt.figure(figsize=(15, 7))
sns.histplot(x='MntFishProducts', data=data)

plt.title('Distribution of Mnt Fish Products Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')

plt.show()
In [2567]:
# Plot the histogram for 'MntSweetProducts'
plt.figure(figsize=(15, 7))
sns.histplot(x='MntSweetProducts', data=data)

plt.title('Distribution of Mnt Sweet Products Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')

plt.show()
In [2568]:
# Plot the histogram for 'MntGoldProducts'
plt.figure(figsize=(15, 7))
sns.histplot(x='MntGoldProds', data=data)

plt.title('Distribution of Mnt Gold Products Purchases')
plt.xlabel('Dollar amount spent')
plt.ylabel('Count')

plt.show()

Note: Try plotting histogram for different numerical features and understand how the data looks like.¶

Observations and Insights for all the plots: _¶

Income: This is now looking more like normal distribution and has a bell curve shape since we removed the outliers.

MntWines: Most of the purchases are under $50. Our data is right skewed. This is one of the items with the most amount spent.

MntFruits: Most of the purchases are under $10. Our data is right skewed. This is the item with the lowest total amount spent.

MntMeat: Most of the purchases are under $75. Our data is right skewed with some outliers. This is one of the items with the most amount spent.

MntFish: Most of the purchases are under $15. Our data is right skewed.

MntSweets: Most of the purchases are under $10. Our data is right skewed with some outliers.

MntGold: Most of the purchases are under $25. Our data is right skewed with some outliers.

2. Univariate analysis - Categorical Data¶

Let us write a function that will help us create bar plots that indicate the percentage for each category. This function takes the categorical column as the input and returns the bar plot for the variable.

In [2569]:
def perc_on_bar(z):
    '''
    plot
    feature: categorical feature
    the function won't work if a column is passed in hue parameter
    '''

    total = len(data[z])                                          # Length of the column
    plt.figure(figsize=(15,5))
    ax = sns.countplot(data[z],palette='Paired',order = data[z].value_counts().index)
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total) # Percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05                  # Width of the plot
        y = p.get_y() + p.get_height()                            # Height of the plot
        
        ax.annotate(percentage, (x, y), size = 12)                # Annotate the percentage 
    
    plt.show()                                                    # Show the plot

Let us plot barplot for the variable Marital_Status.¶

In [2570]:
# Bar plot for 'Marital_Status'
perc_on_bar('Marital_Status')
In [2571]:
perc_on_bar('Education')
In [2572]:
perc_on_bar('Kidhome')
In [2573]:
perc_on_bar('Teenhome')
In [2574]:
perc_on_bar('Complain')

Note: Explore for other categorical variables like Education, Kidhome, Teenhome, Complain.¶

Observations and Insights from all plots: _¶

Marriage:

38.6% of the individuals are married, 25.9% are together, 21.7% are singel, 10.4% are divorced and 3.4% are widowed.

Education:

50.3% have achieved graduation, 25.6% have their masters, 21.7% have their PhD, 2.4% are under the category of basic.

Kidhome:

57.7% of the individuals have 0 kids in their house, 40.1% have 1 kid in their house, 2.1% have 2 kids in their house.

Teenhome:

51.7% of the individuals have 0 teens in their house, 46.0% have 1 teen in their house, 2.3% have 2 teens in their house.

Complain:

99.1% have not made a complaint and .9% have made a complaint.

Bivariate Analysis¶

We have analyzed different categorical and numerical variables. Now, let's check how different variables are related to each other.

Correlation Heat map¶

Heat map can show a 2D correlation matrix between numerical features.

In [2575]:
#heatmap for all the features. 

plt.figure(figsize=(15, 7))                                                        # Setting the plot size
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")  # Plotting the correlation plot
plt.show()

Observations

Kidhome has strong negative correlation with Income, all the food categories and all the purchasing options. Teenhome also shows negative correlation with these variables but not as much as Kidhome.

In [2576]:
#Heatmap for Numerical features. 

num_var = ['Income', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 
           'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3',
           'AcceptedCmp4', 'AcceptedCmp5']

corr = data[num_var].corr()

plt.figure(figsize=(20, 10))                                                        # Setting the plot size
sns.heatmap(corr, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")  # Plotting the correlation plot
plt.show()

Observations and Insights: _¶

Income is positively correlation with MntWines (0.73), MntMeatProducts (0.70), NumCatalogPurchases (0.71), and NumStorePurchases (0.69). Income has a strong negative correlation with NumWebVisitsMonth (-0.65) indicating that the higher income people are spending less time on the website and more time purchases what they want through the catalog or store. There is positive correlation for all the food categories and Income and in the campaigns Income has the most correlation with campaign 5.

MntWines is positively correlated with MntMeatPurchases (0.59), NumWebPurchases (0.54), NumCatalogPurchases (0.67) and NumStorePurchases (0.64).

MntMeatProducts is positively correlated with all of the other food categories and has high correlation with NumCatalogPurchases (0.70).

Of all the food categories MntGoldProducts has the lowest correlation with anything else.

NumDealsPurchases is negatively correlated with all the food categories other than MntWines and MntMeatProducts and that correlation is very low. Deals are only showing promising positive correlation with NumWebVisitsMonth.

NumWebVisitsMonth has the most negative correlation across the board the highest being with Income (-0.65) followed by NumMeatProducts (-0.54), NumCatalogPurchases (-0.53), and then the rest of the food categories. It also has negative correlation with AcceptedCmp1 (-0.20) and AcceptedCmp5 (-0.28).

AcceptedCmp1 has most positive correlation with AcceptedCmp5 (.40)

AcceptedCmp2 has most positive correlation with AcceptedCmp4 (0.30)

AcceptedCmp3 has most positive correlation with AcceptedCmp1 (0.09) and is the only campaign to have negative correlation with another campaign, in this case AcceptedCmp4 (-0.08)

Campaign correlation is key when examining our clusters so that we can make the appropriate recommendations to the marketing department.

The above correlation heatmap only shows the relationship between numerical variables. Let's check the relationship of numerical variables with categorical variables.

Education Vs Income¶

In [2577]:
print(sns.barplot(x='Education', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)

Observations and Insights: _¶

Customers with their PhD have the highest income while those with basic education have the lowest. 50.4% have achieved Graduation, the other half is almost equally divided with 25.6% having their masters and 21.6% having their PhD. The remaining 2.4% are Basic. This tells us that slightly less than a quarter of the customers have the highest income.

Marital Status Vs Income¶

In [2578]:
# Plot the bar plot for Marital_Status and Income

print(sns.barplot(x='Marital_Status', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)

Observations and Insights: _¶

Marital status does not have a great impact on income but widows is slightly higher than the other categories. Widow also makes up only 3.4% of the customers.

Kidhome Vs Income¶

In [2579]:
# Plot the bar plot for Kidhome and Income
print(sns.barplot(x='Kidhome', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)

Observations and Insights: _¶

Customers with no children have a significantly higher income than those who do have children.

We can also visualize the relationship between two categorical variables.

Marital_Status Vs Kidhome¶

In [2580]:
# Plot the bar plot for Marital_Status and Kidhome
pd.crosstab(data['Marital_Status'], data['Kidhome']).plot(kind='bar',stacked=False)
Out[2580]:
<AxesSubplot:xlabel='Marital_Status'>

Observations and Insights: _¶

Overall we can observe that married individuals had more children total than any other group. They have the highest count for every observation under Kidhome; 0, 1, and 2. This makes sense because married people made up 38.6% of the data which is more than just one fifth.

Single and Together had a relavitely close close count for having 1 child in the home with the category of Together being slightly higher. The Together group makes up 25.9% of the data and the Single group makes up 21.7%.

There are no widows who have 2 children. Most have 0, and a very small proportion have 1.

The number of individuals who have children is also on the low side. This group makes up only 10.4% of the data.

Feature Engineering and Data Processing¶

In this section, we will first prepare our dataset for analysis.

  • Creating new columns
  • Imputing missing values

Think About It:

  • The Year_Birth column in the current format might not be very useful in our analysis. The Year_Birth column contains the information about Day, Month, and year. Can we extract the age of each customer?
  • Are there other columns which can be used to create new features?

Age¶

In [2581]:
# Extract only the year from the Year_Birth variable and subtracting it from 2016 will give us the age of the customer at the time of data collection in 2016

data["Age"] = 2016 - pd.to_datetime(data['Year_Birth'], format="%Y").apply(lambda x: x.year) 

# Sorting the values in ascending order 
data["Age"].sort_values()                                         
Out[2581]:
1170     20
46       20
696      21
747      21
1850     21
       ... 
424      75
1950     76
192     116
339     117
239     123
Name: Age, Length: 2232, dtype: int64

Observations and Insights: _¶

If we are basing age by birth year then the youngest individual in our data is 20 and the oldest is 123. We have 3 individuals over the age of 115 which seems like a mistake or possibly these were customers at the time that the data was collected but are no longer living.

Think About It:

  • We could observe from the above output that there are customers with an age greater than 115. Can this be true or a data anomaly? Can we drop these observations?
In [2582]:
# Drop the observations with age > 115
# Hint: Use drop() method with inplace=True

data.drop(data[data["Age"] > 115].index, inplace=True)

data["Age"].sort_values() 
Out[2582]:
1170    20
46      20
2213    21
696     21
995     21
        ..
39      73
358     73
894     73
424     75
1950    76
Name: Age, Length: 2229, dtype: int64

Here we can see that the ages that were greater than 115 have now been dropped.

Now, let's check the distribution of age in the data.

In [2583]:
# Plot histogram to check the distribution of age


plt.figure(figsize=(15, 7))
sns.histplot(x='Age', data=data)
plt.show()

Observations and Insights: _¶

The distribution for age is undefined or bimodal, peaking at around 40 and 45 with a rise at 60 as well. If we had more data this would probably turn into a bell curve or a smoother bimodal model.

Our highest count is at 45 years old.

We are not seeing skewness to the right or left.

Kids¶

  • Let's create feature "Kids" indicating the total kids and teens in the home.
In [2584]:
# Add Kidhome and Teenhome variables to create the new feature called "Kids"
data["Kids"] = data['Kidhome'] + data['Teenhome'] 
In [2585]:
perc_on_bar('Kids')

Observations:

In combining kids and teens and changing the category to "Kids" we have created a new option for 3 children in the home instead of the max being 2 as it was previously.

Family Size¶

  • Let's create a new variable called 'Family Size' to find out how many members each family has.
  • For this, we need to have a look at the Marital_Status variable, and see what are the categories.
In [2586]:
# Check the unique categories in Marial_Status
data['Marital_Status'].unique().tolist()
Out[2586]:
['Single', 'Together', 'Married', 'Divorced', 'Widow']
  • We can combine the sub-categories Single, Divorced, Widow as "Single" and we can combine the sub-categories Married and Together as "Relationship"
  • Then we can create a new variable called "Status" and assign values 1 and 2 to categories Single and Relationship, respectively.
  • Then, we can use the Kids (calculated above) and the Status column to find the family size.
In [2587]:
# Replace "Married" and "Together" with "Relationship"

data['Marital_Status'].replace(["Married", "Together"], "Relationship", inplace=True)
In [2588]:
# Replace "Divorced" and "Widow" with "Single"

data['Marital_Status'].replace(["Divorced", "Widow"], "Single", inplace=True)
In [2589]:
#checking our new sub-categories

data['Marital_Status'].unique().tolist()
Out[2589]:
['Single', 'Relationship']
In [2590]:
# Create a new feature called "Status" by replacing "Single" with 1 and "Relationship" with 2 in Marital_Status
data["Status"] = data['Marital_Status'].replace({"Single": 1, "Relationship": 2}) 
In [2591]:
# Add two variables Status and Kids to get the total number of persons in each family
data["Family_Size"] = data["Status"] + data["Kids"]

perc_on_bar('Family_Size')
In [2592]:
data['Family_Size'].describe()
Out[2592]:
count    2229.000000
mean        2.596231
std         0.907432
min         1.000000
25%         2.000000
50%         3.000000
75%         3.000000
max         5.000000
Name: Family_Size, dtype: float64
In [2593]:
plt.figure(figsize=(5, 5))
sns.histplot(x="Family_Size", data=data, bins=5)
Out[2593]:
<AxesSubplot:xlabel='Family_Size', ylabel='Count'>
In [2594]:
print(sns.barplot(x='Status', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)

Expenses¶

  • Let's create a new feature called "Expenses", indicating the total amount spent by the customers in various products over the span of two years.
In [2595]:
# Create a new feature
# Add the amount spent on each of product 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'
data["Expenses"] = data['MntWines'] + data['MntFruits'] + data['MntMeatProducts']+ data['MntFishProducts'] + data['MntSweetProducts']

data["Expenses"].describe()
Out[2595]:
count    2229.000000
mean      561.427995
std       575.375816
min         4.000000
25%        56.000000
50%       342.000000
75%       962.000000
max      2491.000000
Name: Expenses, dtype: float64
In [2596]:
print(sns.barplot(x='Expenses', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)

Total Purchases¶

  • Let's create a new feature called "NumTotalPurchases", indicating the total number of products purchased by the customers.
In [2597]:
# Create a new feature
# Add the number of purchases from each channel 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases'
data["NumTotalPurchases"] = data['NumDealsPurchases'] + data['NumWebPurchases'] + data['NumCatalogPurchases'] + data['NumStorePurchases']

data["NumTotalPurchases"].describe()
Out[2597]:
count    2229.000000
mean       14.869000
std         7.622187
min         0.000000
25%         8.000000
50%        15.000000
75%        21.000000
max        43.000000
Name: NumTotalPurchases, dtype: float64

Engaged in Days¶

  • Let's create a new feature called "Engaged in days", indicating how long the customer has been with the company.
In [2598]:
# Converting Dt_customer variable to Python date time object
data["Dt_Customer"] = pd.to_datetime(data['Dt_Customer'])

Let's check the max and min of the date.

In [2599]:
# Check the minimum of the date
# Hint: Use the min() method

data["Dt_Customer"].min()
Out[2599]:
Timestamp('2012-01-08 00:00:00')
In [2600]:
# Check the maximum of the date
# Hint: Use the max() method

data["Dt_Customer"].max()
Out[2600]:
Timestamp('2014-12-06 00:00:00')

Think About It:

  • From the above output from the max function, we observed that the last customer enrollment date is December 6th, 2014. Can we extract the number of days a customer has been with the company using some date as the threshold? Can January 1st, 2015 be that threshold?
In [2601]:
 # Assigning date to the day variable
data["day"] = "01-01-2015"                         

# Converting the variable day to Python datetime object
data["day"] = pd.to_datetime(data.day)              
In [2602]:
data["Engaged_in_days"] = (data["day"] - data["Dt_Customer"]).dt.days     

TotalAcceptedCmp¶

  • Let's create a new feature called "TotalAcceptedCmp" that shows how many offers customers have accepted.
In [2603]:
# Add all the campaign related variables to get the total number of accepted campaigns by a customer
# "AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "Response"
data["TotalAcceptedCmp"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"] + data["Response"]

data["TotalAcceptedCmp"].describe()
Out[2603]:
count    2229.000000
mean        0.448183
std         0.890118
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         5.000000
Name: TotalAcceptedCmp, dtype: float64
In [2604]:
print(sns.barplot(x='TotalAcceptedCmp', y='Income', data=data))
AxesSubplot(0.125,0.11;0.775x0.77)
In [2605]:
# Plot histogram and 'Income'
plt.figure(figsize=(15, 7))
sns.histplot(x='TotalAcceptedCmp', data=data)

plt.title('Distribution of Total Accepted Campaigns')
plt.xlabel('Campaigns Accepted')
plt.ylabel('Count')
Out[2605]:
Text(0, 0.5, 'Count')
In [2606]:
sns.boxplot(data=data, x='TotalAcceptedCmp', showmeans=True, color="violet")
Out[2606]:
<AxesSubplot:xlabel='TotalAcceptedCmp'>

AmountPerPurchase¶

  • Let's create a new feature called "AmountPerPurchase" indicating the amount spent per purchase.
In [2607]:
# Divide the "Expenses" by "NumTotalPurchases" to create the new feature AmountPerPurchase 
data["AmountPerPurchase"] = data["Expenses"] / data["NumTotalPurchases"]

data["AmountPerPurchase"].describe()
Out[2607]:
count    2229.000000
mean             inf
std              NaN
min         0.466667
25%         7.800000
50%        20.764706
75%        41.529412
max              inf
Name: AmountPerPurchase, dtype: float64

Now, let's check the maximum value of the AmountPerPurchase.

In [2608]:
# Check the max value
# Hint: Use max() function

data['AmountPerPurchase'].max()
Out[2608]:
inf

Think About It:

  • Is the maximum value in the above output valid? What could be the potential reason for such output?
  • How many such values are there? Can we drop such observations?
In [2609]:
# Find how many observations have NumTotalPurchases equal to 0

data['NumTotalPurchases'].value_counts()[0]
Out[2609]:
2
In [2610]:
data['NumTotalPurchases'].sort_values() == 0
Out[2610]:
981      True
1524     True
2214    False
774     False
2228    False
        ...  
627     False
1252    False
412     False
432     False
21      False
Name: NumTotalPurchases, Length: 2229, dtype: bool
In [2611]:
# Drop the observations with NumTotalPurchases equal to 0, using their indices

#data.drop(data[data['NumTotalPurchases'] == 0].index, inplace = True)

data.drop(index=[981, 1524], inplace = True)
In [2612]:
data['NumTotalPurchases'].describe()
Out[2612]:
count    2227.000000
mean       14.882353
std         7.612563
min         1.000000
25%         8.000000
50%        15.000000
75%        21.000000
max        43.000000
Name: NumTotalPurchases, dtype: float64

Now, let's check the distribution of values in AmountPerPurchase column.

In [2613]:
# Check the summary statistics of the AmountPerPurchase variable 
data['AmountPerPurchase'].describe()
Out[2613]:
count    2227.000000
mean       30.540586
std        44.017468
min         0.466667
25%         7.800000
50%        20.736842
75%        41.325630
max      1657.000000
Name: AmountPerPurchase, dtype: float64
In [2614]:
# Plot the histogram for the AmountPerPurchase variable

plt.figure(figsize=(15, 7))
sns.histplot(x="AmountPerPurchase", data=data)


plt.title('Distribution of Amount Per Purchase')
plt.xlabel('Dollar Amount')
plt.ylabel('Count')

plt.show()

Observations and Insights: _¶

We have one outlier that is extremely high. The majority of the purchases have a lower dollar amount.

Imputing Missing Values¶

In [2615]:
# Impute the missing values for the Income variable with the median

data['Income'] = data['Income'].fillna(data['Income'].median())

Now that we are done with data preprocessing, let's visualize new features against the new income variable we have after imputing missing values.

Income Vs Expenses¶

In [2616]:
# Plot the scatter plot with Expenses on Y-axis and Income on X-axis  

x= data['Income']
y= data['Expenses']

plt.figure(figsize=(20, 10))                                    # Setting the plot size

sns.scatterplot(x,y)                  # Hint: Use sns.scatterplot()  

plt.xticks(fontsize=16)                                         # Font size of X-label

plt.yticks(fontsize=16)                                         # Font size of Y-label

plt.xlabel("Income", fontsize=20, labelpad=20)                  # Title of X-axis

plt.ylabel("Expenses", fontsize=20, labelpad=20)                # Title of Y-axis
Out[2616]:
Text(0, 0.5, 'Expenses')

Observations and Insights: _¶

We can see positive correlation that is exponential in nature. There is a positive exponential relationship between income and expenses. A linear model will not fit our data well.

Family Size Vs Income¶

In [2617]:
# Plot the bar plot for Family Size on X-axis and Income on Y-axis

plt.figure(figsize = (15,5))
sns.barplot(data = data, x = 'Family_Size', y = 'Income', ci=False)
plt.show()

Observations and Insights: _¶

Customers who's family size is only 1 have the highest income followed by those with a family size of 2. In all likelihood this is single adults and couples with no children.

The family size with the lowest income is those with 4 people in their family followed closely by those with 3 people and then those with 5 people.

The most significant thing separating these groups seems to be whether or not they have children.

Important Insights from EDA and Data Preprocessing¶

What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?

Preparing Data for Segmentation¶

Dropping columns that we will not use for segmentation¶

The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.

Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.

Here, we will use the behavioral attributes for segmentation and drop the demographic attributes like Income, Age, and Family_Size. In addition to this, we need to drop some other columns which are mentioned below.

  • Dt_Customer: We have created the Engaged_in_days variable using the Dt_Customer variable. Hence, we can drop this variable as it will not help with segmentation.
  • Complain: About 95% of the customers didn't complain and have the same value for this column. This variable will not have a major impact on segmentation. Hence, we can drop this variable.
  • day: We have created the Engaged_in_days variable using the 'day' variable. Hence, we can drop this variable as it will not help with segmentation.
  • Status: This column was created just to get the Family_Size variable that contains the information about the Status. Hence, we can drop this variable.
  • We also need to drop categorical variables like Education and Marital_Status, Kids, Kidhome, and Teenhome as distance-based algorithms cannot use the default distance like Euclidean to find the distance between categorical and numerical variables.
  • We can also drop categorical variables like AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, and Response for which we have create the variable TotalAcceptedCmp which is the aggregate of all these variables.
In [2618]:
# Dropping all the irrelevant columns and storing in data_model
data_model = data.drop(
    columns=[
        "Year_Birth",
        "Dt_Customer",
        "day",
        "Complain",
        "Response",
        "AcceptedCmp1",
        "AcceptedCmp2",
        "AcceptedCmp3",
        "AcceptedCmp4",
        "AcceptedCmp5",
        "Marital_Status",
        "Status",
        "Kids",
        'Education',
        'Kidhome',
        'Teenhome', 'Income','Age', 'Family_Size'
    ],
    axis=1,
)
In [2619]:
# Check the shape of new data 
data_model.shape
Out[2619]:
(2227, 17)
In [2620]:
# Check first five rows of new data
data_model.head()
Out[2620]:
Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
0 58 635 88 546 172 88 88 3 8 10 4 7 1529 25 997 1 61.160000
1 38 11 1 6 2 1 6 2 1 1 2 5 21 6 151 0 3.500000
2 26 426 49 127 111 21 42 1 8 2 10 4 734 21 498 0 34.952381
3 26 11 4 20 10 3 5 2 2 0 4 6 48 8 91 0 6.000000
4 94 173 43 118 46 27 15 5 5 3 6 5 407 19 347 0 21.421053

Let's plot the correlation plot after we've removed the irrelevant variables.

In [2621]:
# Plot the correlation plot for new data
plt.figure(figsize = (15, 15))

sns.heatmap(data_model.corr(), annot = True, fmt = ".2f")

plt.show()

Observations and Insights:

We can observe high positive correlation between 2 variables, Expenses and MntWines purchases. This is the only correlation that is truly high at 0.90.

We have good correlation between NumberTotalPurchases and MntWines (0.72), NumWebPurchases (0.79), NumCatalogPurchases (0.74), NumStorePurchases (0.83), and Expenses (0.74). Customers who have a high total number of purchases are purchasing more wine from all the different channels, in store being the highest at 0.83.

There is also good correlation between Expenses and MntMeatProducts (0.86), NumCatalogPurchases (0.79), and NumTotalPurchases (0.74).

Good correlation between MntMeatProducts and NumCatalogPurchases (0.70).

Weak positive correlation exists between many categories, namely AmountPerPurchase and MntWines and MntMeatProducts. Also NumWebPurchases and MntWine.

With the advertising campaings in mind we can see that the highest correlation (which is still being considered a weak correlation) for TotalAcceptedCmp is MntWines (0.49). There is also some weak correlation for TotalAcceptedCmp and MntMeatProducts (0.34), NumCatalogPurchases (0.37), and Expenses (0.46).

The highest correlation for NumDealPurchases is with NumWebVisitsMonth (0.37) which is interesting to note from a marketing perspective although this correlation is still weak.

Good negative correlation exists between NumWebVisitsMonth and MntMeatProducts (-0.54), NumCatalogPurchases and Expenses.

We have weak correlation between NumWebVisitsMonth and MntWines, MntFruits, MntFishProducts, MntSweetProducts, NumStorePurchases, NumTotalPurchase, and AmountPerPurchase.

Through observing the above correlation matrix we can see that a high number of combined expenses is closely related to wine purchases and meat purchases. This makes sense because of all the things being sold, these two sub-categores are the most costly. Wine purchases hold the most correlation with every other catagory of interest across the board. People who are buying wine are shopping more often, visiting the website more, spending more, accepting more campaigns, and are more likely to make a purchase with a deal than another other food category (although this is still a very weak relationship.

Higher number of total purchases are fairly evenly distributed between the website, catalog and store with store being the highest.

Interestingly people who are visiting the website more often in a month are less likely to purchase anything. They have a lower number of total purchases, store purchases and spend less per purchase on every food category. They people may be spending more time looking for the best deal. This can be observed by noting that the have negative correlation with every category other than NumDealsPurchases and Engaged_in_days in which they have the highest correlation. This tell us that people who have been customers the longest are also waiting for the best deals by checking the website regularly.

Scaling the Data¶

What is feature scaling?

Feature scaling is a class of statistical techniques that, as the name implies, scales the features of our data so that they all have a similar range. You'll understand better if we look at an example:

If you have multiple independent variables like Age, Income, and Amount related variables, with their range as (18–100 Years), (25K–75K), and (100–200), respectively, feature scaling would help them all to be in the same range.

Why feature scaling is important in Unsupervised Learning?

Feature scaling is especially relevant in machine learning models that compute some sort of distance metric as we do in most clustering algorithms, for example, K-Means.

So, scaling should be done to avoid the problem of one feature dominating over others because the unsupervised learning algorithm uses distance to find the similarity between data points.

Let's scale the data

Standard Scaler: StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.

  1. Data standardization is the process of rescaling the attributes so that they have a mean of 0 and a variance of 1.
  2. The ultimate goal to perform standardization is to bring down all the features to a common scale without distorting the differences in the range of the values.
  3. In sklearn.preprocessing.StandardScaler(), centering and scaling happen independently on each feature.
In [2622]:
# Applying standard scaler on new data
scaler = StandardScaler()                                                   # Initialize the Standard Scaler

df_scaled = scaler.fit_transform(data_model)                # fit_transform the scaler function on new data

df_scaled = pd.DataFrame(df_scaled, columns=data_model.columns)      # Converting the embeddings to a dataframe

df_scaled.head()
Out[2622]:
Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
0 0.306906 0.979274 1.549793 1.735137 2.455586 1.471064 0.841828 0.357919 1.404892 2.633008 -0.561330 0.696875 1.681096 1.329371 1.975678 0.619416 0.695776
1 -0.384035 -0.873925 -0.638021 -0.726901 -0.652191 -0.633425 -0.732464 -0.169952 -1.119121 -0.586255 -1.178960 -0.135935 -0.940315 -1.167064 -1.667464 -0.503906 -0.614453
2 -0.798600 0.358572 0.569049 -0.175222 1.340442 -0.149634 -0.041311 -0.697824 1.404892 -0.228559 1.291559 -0.552339 0.299119 0.803806 -0.173173 -0.503906 0.100251
3 -0.798600 -0.873925 -0.562579 -0.663070 -0.505942 -0.585045 -0.751662 -0.169952 -0.758548 -0.943951 -0.561330 0.280470 -0.893380 -0.904281 -1.925843 -0.503906 -0.557644
4 1.550599 -0.392806 0.418165 -0.216256 0.152175 -0.004497 -0.559676 1.413662 0.323172 0.129137 0.056299 -0.135935 -0.269317 0.541023 -0.823427 -0.503906 -0.207226

Applying T-SNE and PCA to the data to visualize the data distributed in 2 dimensions¶

Applying T-SNE¶

In [2623]:
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed

tsne = TSNE(n_components = 2, random_state = 1, perplexity = 35)        # Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35

data_air_pol_tsne = tsne.fit_transform(data_model)                           # fit_transform T-SNE on new data

data_air_pol_tsne = pd.DataFrame(data_air_pol_tsne, columns=[0, 1])           # Converting the embeddings to a dataframe

plt.figure(figsize=(7, 7))                                                  # Scatter plot for two components
sns.scatterplot(x=data_air_pol_tsne.iloc[:,0], y=data_air_pol_tsne.iloc[:,1], data=data_air_pol_tsne) 
#sns.scatterplot(x=0, y=1, data=data_air_pol_tsne)                             # Plotting T-SNE
Out[2623]:
<AxesSubplot:xlabel='0', ylabel='1'>

Observation and Insights:

T-SNE did not work well with just 2 dimensions. We are not able to see clear clusters here. There is too much multicollinearity which affected our clustering method.

Applying PCA¶

Think about it:

  • Should we apply clustering algorithms on the current data or should we apply PCA on the data before applying clustering algorithms? How would this help?

When the variables used in clustering are highly correlated, it causes multicollinearity, which affects the clustering method and results in poor cluster profiling (or biased toward a few variables). PCA can be used to reduce the multicollinearity between the variables.

In [2624]:
# Defining the number of principal components to generate
n = data_model.shape[1]                                        # Storing the number of variables in the data

pca = PCA(n_components = n, random_state = 1)            # Initialize PCA with n_components = n and random_state=1

data_pca = pd.DataFrame(pca.fit_transform(df_scaled))      # fit_transform PCA on the scaled data

# The percentage of variance explained by each principal component is stored
exp_var = pca.explained_variance_ratio_                     

Let's plot the first two components and see how the data points are distributed.

In [2625]:
# Scatter plot for two components using the dataframe data_pca

plt.figure(figsize = (7, 7))

sns.scatterplot(x = data_pca[0], y = data_pca[1])

plt.xlabel("PC1")

plt.ylabel("PC2")

plt.show()

Let's apply clustering algorithms on the data generated after applying PCA

K-Means¶

In [2626]:
distortions = []                                                  # Create an empty list

K = range(2, 10)                                                  # Setting the K range from 2 to 10

for k in K:
    kmeanModel = KMeans(n_clusters=k,random_state=4)              # Initialize K-Means
    kmeanModel.fit(data_pca)                                      # Fit K-Means on the data
    distortions.append(kmeanModel.inertia_)                       # Append distortion values to the empty list created above
In [2627]:
# Plotting the elbow plot
plt.figure(figsize=(16, 8))                                            # Setting the plot size

plt.plot(K, distortions, "bx-")                                        # Plotting the K on X-axis and distortions on y-axis

plt.xlabel("k")                                                        # Title of x-axis

plt.ylabel("Distortion")                                               # Title of y-axis

plt.title("The Elbow Method showing the optimal k")                    # Title of the plot
plt.show()

In the above plot, the elbow is seen for K=3 and K=5 as there is some drop in distortion at K=3 and K=5.

Think About It:

  • How do we determine the optimal K value when the elbows are observed at 2 or more K values from the elbow curve?
  • Which metric can be used to determine the final K value?

We can use the silhouette score as a metric for different K values to make a better decision about picking the number of clusters(K).

What is the silhouette score?¶

Silhouette score is one of the methods for evaluating the quality of clusters created using clustering algorithms such as K-Means. The silhouette score is a measure of how similar an object is to its cluster (cohesion) compared to other clusters (separation). Silhouette score has a range of [-1, 1].

  • Silhouette coefficients near +1 indicate that the clusters are dense and well separated, which is good.
  • Silhouette score near -1 indicates that those samples might have been assigned to the wrong cluster.

Finding silhouette score for each value of K

In [2628]:
sil_score = []                                                             # Creating empty list
cluster_list = range(3, 7)                                                 # Creating a range from 3 to 7
for n_clusters in cluster_list:
    
    # Initialize K-Means with number of clusters equal to n_clusters and random_state=1
    
    clusterer = KMeans(n_clusters= n_clusters, random_state = 1)
        #clusterer = KMeans(n_clusters = k, random_state = 1).fit(data_pca)
    
    # Fit and predict on the pca data
    clusterer.fit(data_pca)
    preds = clusterer.fit_predict(data_pca)
    
    # Calculate silhouette score - Hint: Use silhouette_score() function
    score = silhouette_score(data_pca, preds)  
    
    # Append silhouette score to empty list created above #not (preds)
    sil_score.append(preds)       
    
    # Print the silhouette score
    print( "For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))  
For n_clusters = 3, the silhouette score is 0.2718010164963816)
For n_clusters = 4, the silhouette score is 0.25228159218826157)
For n_clusters = 5, the silhouette score is 0.23855028311006854)
For n_clusters = 6, the silhouette score is 0.1219241977999894)

From the above silhouette scores, 3 appears to be a good value of K. So, let's build K-Means using K=3.

Applying K-Means on data_pca¶

In [2629]:
kmeans = KMeans(n_clusters=3, random_state = 1)       # Initialize the K-Means algorithm with 3 clusters and random_state=1

kmeans.fit(data_pca)                                     # Fitting on the data_pca
Out[2629]:
KMeans(n_clusters=3, random_state=1)
In [2630]:
data_pca["K_means_segments_3"] = kmeans.labels_          # Adding K-Means cluster labels to the data_pca data

data["K_means_segments_3"] = kmeans.labels_               # Adding K-Means cluster labels to the whole data

data_model["K_means_segments_3"] = kmeans.labels_         # Adding K-Means cluster labels to data_model
In [2631]:
# Let's check the distribution
data_model["K_means_segments_3"].value_counts()
Out[2631]:
1    1059
0     607
2     561
Name: K_means_segments_3, dtype: int64

**Let's visualize the clusters using PCA**

In [2632]:
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
In [2633]:
PCA_PLOT(0, 1, data_pca, "K_means_segments_3")

Observations and Insights:

Cluster sizes are cluster 0 with 607, cluster 1 with 1059 and cluster 2 with 561. Our clusters have boundaries and are separated well.

Cluster Profiling¶

In [2634]:
# Taking the cluster-wise mean of all the variables. Hint: First groupby 'data' by 'K_means_segments_3' and then find mean
cluster_profile_KMeans_3 = data.groupby("K_means_segments_3").mean()
In [2635]:
# Highlighting the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_3.style.highlight_max(color="lightgreen", axis=0)
Out[2635]:
  Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Status Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
K_means_segments_3                                                                
0 1965.861614 57700.039539 0.278418 0.846787 47.731466 455.062603 23.329489 138.293245 30.607908 24.774300 62.331137 3.795717 6.487644 3.102142 7.906096 5.739703 0.070840 0.123558 0.023064 0.036244 0.016474 0.008237 0.133443 50.138386 1.125206 1.658979 2.784185 672.067545 21.291598 592.602965 0.403624 31.344661
1 1971.018886 35319.806421 0.755430 0.474032 49.210576 45.411709 5.047214 23.671388 7.395656 5.148253 15.574127 2.021719 2.146364 0.579792 3.288952 6.379603 0.068933 0.015109 0.000000 0.000944 0.001889 0.010387 0.084986 44.981114 1.229462 1.650614 2.880076 86.674221 8.036827 500.758263 0.171860 9.412071
2 1968.142602 75975.609626 0.037433 0.204991 50.436720 633.704100 69.916221 462.395722 102.483066 71.395722 78.427807 1.294118 5.219251 6.024955 8.331551 2.891266 0.083779 0.135472 0.263815 0.215686 0.030303 0.007130 0.290553 47.857398 0.242424 1.618538 1.860963 1339.894831 20.869875 550.069519 1.019608 69.554890

Observations and Insights:

Our clusters are divided by Income with High, Medium, and Low. We also have distinct divisions for Expenses and AmountPerPurchase with High, Medium, and Low.

Let us create a boxplot for each of the variables

In [2636]:
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
In [2637]:
# Creating boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['K_means_segments_3'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster:¶

Cluster 0: Middle Income

Summary for cluster 0:___ This cluster has 607 individuals represents the middle group of individuals. They have the second highest income, averge of 1 kid in the home with an average family size of 3, and are the middle group for all purchases. They do however use more deals than any other group, shop on the web more, represent the oldest group, make the highest number of purchases and have been customers for the longest amount of time. The are the middle group for web visits, store purchases and cataglog purchases, but they have the highest number of web purchases. They are second highest for accepting campaigns and total amount per purchase. They are the lowest for recency which is the only category in which they are the lowest. The average age for this cluster is 50 years old.

This cluster does the most shopping online and takes advantages of more deals than any other group. They also shop the most frequently and have been customers for the longest amount of time. Their income and family size suggest that they are the most closely related to the middle class.

Cluster 1: Low Income

Summary for cluster 1:___ This cluster has 1059 individuals and is the group that has the lowest income, are the youngest and have the most kids in their home. They have spent the least amount of money on all categories of food by a considerable amount and make the least number of purchases. They have the lowest expenses, have not accepted many campaigns but they rank second in those who have made purchases using deals. They also visit the website more than the other two categories. They are also the youngest.

This group does not make many purchases or shop a lot with the company. They do a fair amount of window shopping so to speak by looking at the website, but they make few actual purchases. Despite spending the most time looking at the website, they are still lowest in terms of website purchases. This group is really watching how they spend their money and are buying their groceries elsewhere and have been customers for the shortest period of time.

Cluster 2: High Income

Summary for cluster 2:___ This cluster has 561 individuals and represents the individuals with the highest income. The make the most purchases in all sub-categories and use the least number of deals and visit the website the least. They are highest for catalog and in store purchases as well as the highest for accepting every campaign. They have the second highest number of purchases but their purchase amount is vastly greater than any other group as are their expenses.

The high income group is the most engages cluster. They have responded to the highest number of campaigns and spend the most money with the company. This group is also the oldest group and uses the catalog more than any other group prefering catalog and in store purchases over going on the website at all 

Think About It:

  • Are the K-Means profiles with K=3 providing any deep insights into customer purchasing behavior or which channels they are using?
  • What is the next step to get more meaningful insights?

We can see from the above profiles that K=3 segments the customers into High, Medium and Low-income customers, and we are not getting deep insights into different types of customers. So, let's try to build K=5 (which has another elbow in the Elbow curve) and see if we can get better cluster profiles.

In [2638]:
# Dropping labels we got from K=3 since we will be using PCA data for prediction
# Drop K_means_segments_3. Hint: Use axis=1 and inplace=True
data_pca.drop("K_means_segments_3", axis=1, inplace=True)
data.drop("K_means_segments_3", axis=1, inplace=True)

Let's build K-Means using K=5

In [2639]:
# Fit the K-Means algorithm using number of cluster as 5 and random_state=0 on data_pca
kmeans = KMeans(n_clusters=5, random_state = 1) 
       
kmeans.fit(data_pca)  
Out[2639]:
KMeans(n_clusters=5, random_state=1)
In [2640]:
# Add K-Means cluster labels to data_pca
data_pca["K_means_segments_5"] = kmeans.labels_          

data_model["K_means_segments_5"] = kmeans.labels_    

# Add K-Means cluster labels to whole data
data["K_means_segments_5"] = kmeans.labels_ 

# Add K-Means cluster labels to data_model
data_model["K_means_segments_5"] = kmeans.labels_ 
In [2641]:
# Let's check the distribution
data_model["K_means_segments_5"].value_counts()
Out[2641]:
2    994
0    403
1    357
4    308
3    165
Name: K_means_segments_5, dtype: int64

Let's visualize the clusters using PCA

In [2642]:
# Hint: Use PCA_PLOT function created above

def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)

PCA_PLOT(0, 1, data_pca, "K_means_segments_5")

Cluster Profiling¶

In [2643]:
# Take the cluster-wise mean of all the variables. Hint: First groupby 'data' by cluster labels column and then find mean
cluster_profile_KMeans_5 = data.groupby("K_means_segments_5").mean()
In [2644]:
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_5.style.highlight_max(color="lightgreen", axis=0)
Out[2644]:
  Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Status Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
K_means_segments_5                                                                
0 1965.585608 63397.915633 0.106700 0.751861 48.406948 509.126551 33.553350 182.409429 40.461538 32.322581 62.414392 2.404467 6.347395 3.811414 8.970223 4.565757 0.044665 0.099256 0.024814 0.029777 0.007444 0.014888 0.054591 50.414392 0.858561 1.645161 2.503722 797.873449 21.533499 544.935484 0.260546 37.399934
1 1967.568627 74565.285714 0.025210 0.204482 51.050420 508.112045 80.560224 467.282913 120.535014 80.686275 86.733894 1.366947 4.893557 6.014006 8.378151 2.579832 0.030812 0.028011 0.103641 0.109244 0.000000 0.000000 0.140056 48.431373 0.229692 1.627451 1.857143 1257.176471 20.652661 531.176471 0.411765 63.251772
2 1971.243461 34929.668008 0.754527 0.461771 49.263581 39.363179 4.852113 20.918511 6.972837 5.027163 14.565392 1.850101 2.010060 0.523139 3.230382 6.305835 0.063380 0.014085 0.000000 0.001006 0.002012 0.011066 0.078471 44.756539 1.216298 1.651911 2.868209 77.133803 7.613682 491.091549 0.158954 8.966401
3 1969.896970 79500.800000 0.048485 0.187879 48.557576 945.696970 49.745455 476.921212 72.006061 56.648485 66.472727 1.206061 5.921212 6.200000 8.030303 3.690909 0.236364 0.412121 0.690909 0.496970 0.133333 0.006061 0.715152 46.103030 0.236364 1.600000 1.836364 1601.018182 21.357576 596.436364 2.684848 87.930039
4 1966.600649 48840.996753 0.584416 0.860390 47.626623 318.444805 11.090909 92.857143 18.685065 14.181818 54.425325 5.441558 6.035714 2.113636 5.889610 7.220779 0.103896 0.113636 0.003247 0.032468 0.006494 0.006494 0.214286 49.399351 1.444805 1.665584 3.110390 455.259740 19.480519 658.461039 0.474026 22.531721

Let's plot the boxplot

In [2645]:
# Create boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['K_means_segments_5'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster¶

Cluster 0: Middle Income with Kids and Oldest

Summary for cluster 0:___

This group is made up of 403 of the oldest customers with the middle most average income. The have more teens in their home and less small children. They are the middle group in terms of all food purchases and have the second lowest recency, shopping more frequently than 3 other groups. They have the highest number for in store purchases, web purchases, and total purchases. They also have the highest number of complaints. Despite making the most purchases the are the middle group for expenses and total amount per purchase. They responded worst to campaign 2 and best to campaign 4.

Cluster 1: High Income with no Kids Older

Summary for cluster 1:___ This group is made up of 357 of the the second highest income customers with the lowest number of kids in their home. They have the highest recency and spend the most on Fruit, Fish, Sweets, and Gold products. They are second highest for catalog and instore purchases. They have the second highest expenses and purchases. The have the lowest number of website visits and they responded best to campaigns 1 and 5 and worst to campaign 2.

Cluster 2: Low Income with Kids and Youngest

Summary for cluster 2:___

This group is made up of 994 customer with the lowest income, the highest number of kids in the house and is the youngest group. They have the second highest recency and have the lowest numbers for purchases of all the different food categories. They are the middle group for deals purchases and the lowest for store, web and catalog purchases. They are second highest for the number of website visits per month and this coupled with the low income category and higher deals purchases indicates that they are mostly shopping when they have a deal and are regularly checking for deals. The are the newest customers with the lowest number of expenses, and purchases. They were most receptive to campaign 3 and least receptive to campaign 5. This group would be a good group for email marketing. They are checking the website regualarly looking for deals and personalized adds for them would get them in the door more often.

Cluster 3: High Income with no Kids and Young

Summary for cluster 3:___

This group is made up of 165 customers with the highest income, no children and has spend the most on wine and meat of all the groups. They have the highest number of catalog purchases and are the middle group for web purchases and store purchases. This group has accepted more of the campaigns than any other group. They have the second highest expenses and the second highest total number of purchases. They spend the most per purchase and are ranked second for how long they have been customers.

Cluster 4: Low Income with Kids and Older

Summary for cluster 4: __ This group is made up of 308 customers with low income, kids, and they are older. They are second lowest in all purchases but are ranked highest for making purchases with deals. They are second highest for website purchases and second lowest for catalog purchases and in store purchases. They have the highest number of website visits per month and the largest family size. The responded best to ad campaign 4 and have been customers the longest of all groups.

In [2646]:
# Dropping labels we got from K-Means since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("K_means_segments_5", axis=1, inplace=True)
data.drop("K_means_segments_5", axis=1, inplace=True)

From the above profiles, K=5 provides more interesting insights about customer's purchasing behavior and preferred channels for purchasing products. We can also see that the High, Medium and Low income groups have different age groups and preferences, which was not evident in K=3. So, we can choose K=5.

K-Medoids¶

Let's find the silhouette score for K=5 in K-Medoids

In [2647]:
kmedo = KMedoids(n_clusters = 5, random_state = 1)     # Initializing K-Medoids with number of clusters as 5 and random_state=1

preds = kmedo.fit_predict(data_pca)                    # Fit and predict K-Medoids using data_pca

score = silhouette_score(data_pca, preds)          # Calculate the silhouette score

print(score)                   # Print the score
0.10644004638344105

Observations and Insights:

Our silhouette score for K-Medoids is 0.10644004638344105 which is lower than it was for K-means with k=5.

In [2648]:
# Predicting on data_pca and ddding K-Medoids cluster labels to the whole data
data['K_medoids_segments_5']= kmedo.fit_predict(data_pca)

# Predicting on data_pca and ddding K-Medoids cluster labels to data_model
data_model['K_medoids_segments_5']= kmedo.fit_predict(data_pca)

# Predicting on data_pca and ddding K-Medoids cluster labels to data_pca
data_pca['K_medoids_segments_5']= kmedo.fit_predict(data_pca)
In [2649]:
# Let's check the distribution
data_model["K_medoids_segments_5"].value_counts()
Out[2649]:
1    630
2    580
0    477
3    278
4    262
Name: K_medoids_segments_5, dtype: int64

Let's visualize the clusters using PCA

In [2650]:
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)

PCA_PLOT(0, 1, data_pca, "K_medoids_segments_5")

Cluster Profiling¶

In [2651]:
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_KMedoids_5 = data.groupby("K_medoids_segments_5").mean()
In [2652]:
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMedoids_5.style.highlight_max(color="lightgreen", axis=0)
Out[2652]:
  Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Status Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
K_medoids_segments_5                                                                
0 1968.664570 76368.226415 0.044025 0.220126 50.884696 658.668763 75.383648 481.109015 109.316562 79.784067 82.222222 1.385744 5.488470 5.920335 8.576520 3.075472 0.090147 0.132075 0.285115 0.232704 0.029350 0.004193 0.322851 47.335430 0.264151 1.610063 1.874214 1404.262055 21.371069 566.538784 1.092243 71.858839
1 1972.455556 32820.985714 0.800000 0.409524 39.488889 23.952381 3.442857 13.973016 5.104762 3.395238 10.428571 1.560317 1.644444 0.312698 2.885714 6.553968 0.085714 0.004762 0.000000 0.000000 0.003175 0.011111 0.084127 43.544444 1.209524 1.647619 2.857143 49.868254 6.403175 475.193651 0.177778 7.093609
2 1965.594828 61851.193103 0.172414 0.734483 46.206897 499.455172 27.700000 180.781034 37.748276 27.312069 63.991379 3.101724 6.156897 3.893103 8.237931 4.874138 0.068966 0.132759 0.043103 0.044828 0.020690 0.010345 0.112069 50.405172 0.906897 1.660345 2.567241 772.996552 21.389655 565.806897 0.422414 37.009533
3 1969.600719 38431.715827 0.784173 0.654676 45.960432 134.028777 8.032374 55.309353 13.003597 7.888489 37.133094 4.276978 4.262590 1.194245 4.197842 7.733813 0.068345 0.046763 0.000000 0.017986 0.003597 0.014388 0.223022 46.399281 1.438849 1.597122 3.035971 218.262590 13.931655 710.190647 0.359712 15.142116
4 1967.255725 43463.137405 0.561069 0.610687 78.835878 90.087786 8.790076 37.767176 11.583969 8.828244 19.458015 2.057252 2.782443 1.015267 4.206107 4.919847 0.026718 0.041985 0.003817 0.007634 0.000000 0.003817 0.000000 48.744275 1.171756 1.717557 2.889313 157.057252 10.061069 394.618321 0.080153 13.714535

Let's plot the boxplot

In [2653]:
# Create boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['K_medoids_segments_5'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster¶

Cluster 0: High Income no Kids/Teens

Summary for cluster 0:___ This group is made up of 477 customers who have high recency, the highest purchase amount for all food categories and the lowest use of deals. They have the lowest number of website visits and the highest store purchases and catalog purchases. They have the highest expenses, number of total purchases, amount per purchase and have accepted the most campaigns.

Cluster 1: Low Income with Kids and Youngest

Summary for cluster 1:___ This group is made up of 630 customers who have the lowest purchase amount for all categories but also the lowest recency. They are second highest for number of web visits in a month. They are second to lowest in accepting campaigns and in how long they have been customers. They have the lowest expenses and number of purchases.

Cluster 2: High/Middle Income with Kids/Teens and Oldest

Summary for cluster 2:___ This group is made up of 580 customers who have higher income and kids in the home. They are second highest for purchases on all food categories and deal usage. They are highest for web purchases and second highest for store purchases and catalog purchases. They are second lowest for web visits per month and were highest for accepting campaign 4. They make up the oldest group with the highest number of purchases and amount per purchase.

Cluster 3: Low/Middle Income with Kids/Teens Second Youngest

Summary for cluster 3:___ This group is made up of 278 customers who have low income and kids in the home. They are the middle group for food purchases and the highest users of the deals. Middle group for web and catalog purchases, second lowest for store purchases and highest for web visits per month. They are highest for complaints and second highest for response. They have the largest family size and are the middle group for expenses, number of total purchasea and amount per purchase. They have been customers the longest.

Cluster 4: Middle Income with Kids/Teens Second Oldest

Summary for cluster 4:___ This group is made up of 262 customers who have the highest recency. They are either second lowest or the very middle group for all food purchase categories. They are the middle group for number of deals purchases, catalog purchases, store purchases, and web visits. They are the second lowest group for web purchases. The campaign that they responded to the best was campaign 4 but the have the lowest response. They are the second oldest group with the highest status and second highest family size. They have the second lowest expenses, total purchases and amount per purchase. They are lowest for total accepted campaigns and have been customers for the least amount of time.

In [2654]:
# Dropping labels we got from K-Medoids since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("K_medoids_segments_5", axis=1, inplace=True)
data.drop("K_medoids_segments_5", axis=1, inplace=True)

Hierarchical Clustering¶

Let's find the Cophenetic correlation for different distances with different linkage methods.

What is a Cophenetic correlation?¶

The cophenetic correlation coefficient is a correlation coefficient between the cophenetic distances(Dendrogramic distance) obtained from the tree, and the original distances used to construct the tree. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points.

The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.

Cophenetic correlation is the way to compare two or more dendrograms.

Let's calculate Cophenetic correlation for each of the distance metrics with each of the linkage methods

In [2655]:
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]

# list of linkage methods
linkage_methods = ["single", "complete", "average"]

high_cophenet_corr = 0                                                 # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0]                                                    # Creating a list by assigning 0's to it

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(data_pca, metric=dm, method=lm)                    # Applying different linkages with different distance on data_pca
        c, coph_dists = cophenet(Z, pdist(data_pca))                   # Calculating cophenetic correlation
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:                                     # Checking if cophenetic correlation is higher than previous score
            high_cophenet_corr = c                                     # Appending to high_cophenet_corr list if it is higher
            high_dm_lm[0] = dm                                         # Appending its corresponding distance
            high_dm_lm[1] = lm                                         # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.8070498805503135.
Cophenetic correlation for Euclidean distance and complete linkage is 0.5915211029724521.
Cophenetic correlation for Euclidean distance and average linkage is 0.8563332929010883.
Cophenetic correlation for Chebyshev distance and single linkage is 0.7923751605955872.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.68529576496396.
Cophenetic correlation for Chebyshev distance and average linkage is 0.8041971696973682.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.6226039066330016.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.38496663858473046.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.6546345527169063.
Cophenetic correlation for Cityblock distance and single linkage is 0.8051878506561675.
Cophenetic correlation for Cityblock distance and complete linkage is 0.816242660463673.
Cophenetic correlation for Cityblock distance and average linkage is 0.8566433845473891.
In [2656]:
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.8566433845473891, which is obtained with Cityblock distance and average linkage.
In [2657]:
HCmodel = AgglomerativeClustering(n_clusters= 3, affinity= "cityblock", linkage= "average",) 

# Fit on data_pca
HCmodel.fit(data_pca) 

labels = HCmodel.labels_           

score = silhouette_score(data_pca, labels)           # Calculate the silhouette score

print(score) 
0.6409888705869505
In [2658]:
# Add Agglomerative Clustering cluster labels to data_pca

data_pca["HClabels"]= HCmodel.labels_

# Add Agglomerative Clustering cluster labels to the whole data

data["HClabels"]= HCmodel.labels_

# Add Agglomerative Clustering cluster labels to data_model

data_model["HClabels"]= HCmodel.labels_
In [2659]:
data_model["HClabels"].value_counts()
Out[2659]:
0    2225
2       1
1       1
Name: HClabels, dtype: int64

Let's have a look at the dendrograms for different linkages with Cityblock distance

In [2660]:
# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Cityblock", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Observations and Insights:

We can make out 2 good links in single linkage. Complete linkage looks promising in that you can see links but because we are more concerned with the links at the top having the greatest height we should look to average linkage. Average linkage shows the distance between the two subclusters that are merged by that link as looking the best.

single is .81 complete is .82 average is .86

Think about it:

  • Can we clearly decide the number of clusters based on where to cut the dendrogram horizontally?
  • What is the next step in obtaining number of clusters based on the dendrogram?

Let's have a look at the dendrograms for different linkages with Chebyshev distance

In [2661]:
# Plot the dendrogram for Chebyshev distance with linkages single, complete and average. 
# Hint: Use Chebyshev distance as the metric in the linkage() function 

# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Chebyshev", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Observations and Insights: Single linkage gives 2 links clearly before loosing height. Complete linkage shows the most clearly defined links bust again height is lost drastically after 2. Average looks good for the first two but quickly looses height after that.

Let's have a look at the dendrograms for different linkages with Mahalanobis distance

In [2662]:
# Plot the dendrogram for Mahalanobis distance with linkages single, complete and average. 
# Hint: Use Mahalanobis distance as the metric in the linkage() function 
# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Mahalanobis", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Observations and Insights:

Mahalanobis has really good height for their links in average.

single linkage has a score of.62

complete linkage has a score of .38

average linkage has a score of .65

Let's have a look at the dendrograms for different linkages with Euclidean distance

In [2663]:
# Plot the dendrogram for Euclidean distance with linkages single, complete, average and ward. 
# Hint: Use Euclidean distance as the metric in the linkage() function 
# List of linkage methods
linkage_methods = ["single", "complete", "average"]

# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))            # Setting the plot size

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_pca, metric="Euclidean", method=method)                  # Measures the distances between two clusters

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")           # Title of dendrogram

    coph_corr, coph_dist = cophenet(Z, pdist(data_pca))                       # Finding cophenetic correlation for different linkages with city block distance
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Think about it:

  • Are there any distinct clusters in any of the dendrograms?

Observations and Insights:

Single linkage has cophenetic correlation of 0.81

Complete linkage has cophenetic correlation of 0.59

Euclidean average has good height for the first 3 clusters with cophenetic correlation of 0.86

In [2664]:
# Initialize Agglomerative Clustering with affinity (distance) as Euclidean, linkage as 'Ward' with clusters=3
HCmodel = AgglomerativeClustering(n_clusters= 3, affinity= "euclidean", linkage= "ward",) 

# Fit on data_pca
HCmodel.fit(data_pca) 

labels = HCmodel.labels_           

score = silhouette_score(data_pca, labels)           # Calculate the silhouette score

print(score) 
0.2543856181377133
In [2665]:
# Add Agglomerative Clustering cluster labels to data_pca

data_pca["HClabels"]= HCmodel.labels_

# Add Agglomerative Clustering cluster labels to the whole data

data["HClabels"]= HCmodel.labels_

# Add Agglomerative Clustering cluster labels to data_model

data_model["HClabels"]= HCmodel.labels_
In [2666]:
# Let's check the distribution
data_model["HClabels"].value_counts()
Out[2666]:
1    1010
2     675
0     542
Name: HClabels, dtype: int64

Let's visualize the clusters using PCA.

In [2667]:
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)

PCA_PLOT(0, 1, data_pca, "HClabels")

Cluster Profiling¶

In [2668]:
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_HClabels = data.groupby("HClabels").mean()
In [2669]:
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_HClabels.style.highlight_max(color="lightgreen", axis=0)
Out[2669]:
  Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Status Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
HClabels                                                                
0 1968.732472 75132.143911 0.055351 0.241697 48.254613 607.841328 72.605166 449.559041 104.774908 73.365314 73.188192 1.418819 5.370849 5.618081 8.612546 2.998155 0.073801 0.125461 0.249077 0.195572 0.033210 0.005535 0.278598 47.267528 0.297048 1.608856 1.905904 1308.145756 21.020295 543.605166 0.955720 67.627434
1 1971.227723 34931.661386 0.768317 0.470297 49.204950 41.816832 4.583168 21.501980 6.456436 4.725743 15.413861 1.923762 2.087129 0.536634 3.202970 6.377228 0.072277 0.012871 0.000000 0.000000 0.001980 0.010891 0.087129 44.772277 1.238614 1.646535 2.885149 79.084158 7.750495 502.357426 0.174257 8.990624
2 1965.514074 57847.875556 0.272593 0.777778 49.675556 456.499259 21.848889 152.648889 30.511111 23.712593 63.837037 3.642963 6.103704 3.392593 7.485926 5.623704 0.074074 0.127407 0.040000 0.056296 0.013333 0.008889 0.140741 50.485926 1.050370 1.671111 2.721481 685.220741 20.625185 587.536296 0.451852 33.006349

Let's plot the boxplot

In [2670]:
# Create boxplot for each of the variables
all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['HClabels'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster¶

Cluster 0: High Income no Kids

Summary for cluster 0: High Income no Kids

This cluster is made up of 542 customers with the highest income and no kids or teens in the homes. They have the lowest recency and the highest purchases for all the food categories, use the least number of deals, are the middle group for we purchases and the highest group for catalog purchase and store purchases. They have the least number of web visits per month and have the best acceptance for campaigns over all, coming in the lead for campaigns 1, 2, and 5, 5 being the one they responded to the best. They have the highest expenses, total number of purchases and amount per purchase and are the middle group for the age category.

Cluster 1: Low Income with Kids

Summary for cluster 1:___ This group is made up of 1010 of the youngest customers with low income and the most kids at home. They have the lowest purchases for all the food categories by a considerable amount and are the lowest for web, catalog and store purchases. They are the middle group for deals purchases and are the highest for number of web visits per month. Of all the campaigns they responded best to campaign 3 and they have the highest number of complaints. They have the largest family size, the lowest expenses, total purchases, total accepted campaigns, amount per purchase and have been customers for the the shortest amount of time.

Cluster 2: Middle Income with Kids

Summary for cluster 2:___ This group is made up of 675 customers with middle income and kids in the home. They have the highest recency and are the middle group for all food products and the highest for deals purchases and web purchases. Middle group for catalog and store purchases, web visits per month, and are the highest group for accepting campaign 3 and 4, 4 being the one that was responded to the best. They are the oldest group and the middle group for expenses, number of total purchases, total accepted campaigns, and amount per purchase. They have been customers the longest of all the groups.

Observations and Insights:

Ward is known for making clusters that are more equal in value but here we still have a cluster that have twice as many values as another.

In [2671]:
# Dropping labels we got from Agglomerative Clustering since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("HClabels", axis=1, inplace=True)
data.drop("HClabels", axis=1, inplace=True)

DBSCAN¶

DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.

Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.

In [2672]:
# Initializing lists
eps_value = [2,3]                       # Taking random eps value
min_sample_values = [6, 20]              # Taking random min_sample value

# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}  
In [2673]:
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0                                               # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0]                                                     # Assigning 0's to the high_i_j list
key = res.keys()                                                      # Assigning dictionary keys to a variable called key
for i in key:
    z = res[i]                                                        # Assigning dictionary values of each i to z
    for j in z:
        db = DBSCAN(eps=i, min_samples=j).fit(data_pca)               # Applying DBSCAN to each of the combination in dictionary
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        labels = db.labels_
        silhouette_avg = silhouette_score(data_pca, labels)           # Finding silhouette score 
        print( 
            "For eps value =" + str(i),
            "For min sample =" + str(j),
            "The average silhoutte_score is :",
            silhouette_avg,                                          # Printing the silhouette score for each of the combinations
        )
        if high_silhouette_avg < silhouette_avg:                     # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j              
            high_i_j[0] = i
            high_i_j[1] = j
For eps value =2 For min sample =6 The average silhoutte_score is : 0.1998438349972484
For eps value =2 For min sample =20 The average silhoutte_score is : 0.33805442426361443
For eps value =3 For min sample =6 The average silhoutte_score is : 0.3330837442382384
For eps value =3 For min sample =20 The average silhoutte_score is : 0.3390186745863706
In [2674]:
# Printing the highest silhouette score
print("Highest_silhoutte_avg is {} for eps = {} and min sample = {}".format(high_silhouette_avg, high_i_j[0], high_i_j[1]))
Highest_silhoutte_avg is 0 for eps = 3 and min sample = 20

Now, let's apply DBSCAN using the hyperparameter values we have received above.

In [2675]:
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps = 3, min_samples = 20)
In [2676]:
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data

data["DBSCAN"]= dbs.fit_predict(data_pca)

# fit_predict on data_pca and add DBSCAN cluster labels to data_model

data_model["DBSCAN"]= dbs.fit_predict(data_pca)

# fit_predict on data_pca and add DBSCAN cluster labels to data_pca

data_pca["DBSCAN"]= dbs.fit_predict(data_pca)
In [2677]:
# Let's check the distribution
data_model["DBSCAN"].value_counts()
Out[2677]:
 0    1902
-1     325
Name: DBSCAN, dtype: int64

Let's visualize the clusters using PCA.

In [2678]:
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)

PCA_PLOT(0, 1, data_pca, "DBSCAN")

Observations and Insights:

While our silhouette score is very good, our clusters are not what we are looking for. Two clusters will not give insight into our customers and these two clusters in particular are not evenly distributed.

DBSCAN was able to give only two clusters with eps = 3 and min_sample = 20 which is very skewed. It is not able perform well on this dataset. This is due to the overlapping nature of the data which causes multiple clusters to get grouped together into one large cluster.

Think about it:

  • Changing the eps and min sample values will result in different DBSCAN results? Can we try more value for eps and min_sample?

Note: You can experiment with different eps and min_sample values to see if DBSCAN produces good distribution and cluster profiles.

In [2679]:
# Dropping labels we got from DBSCAN since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("DBSCAN", axis=1, inplace=True)
data.drop("DBSCAN", axis=1, inplace=True)

Gaussian Mixture Model¶

Let's find the silhouette score for K=5 in Gaussian Mixture

In [2680]:
gmm = GaussianMixture(n_components = 5, random_state = 1) # Initialize Gaussian Mixture Model with number of clusters as 5 and random_state=1

preds = gmm.fit_predict(data_pca)                       # Fit and predict Gaussian Mixture Model using data_pca

score = silhouette_score(data_pca, preds)               # Calculate the silhouette score

print(score)                                            # Print the score
0.1423192906893203

Observations and Insights:

The silhouette score can be improved by lowering the number of components, but we will loose some of the more interesting discrepencies that we have found by having 5 clusters. For clusters equal to 5 we get a better silhouette score with k-means.

In [2681]:
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to the whole data
data['Gaussian_Mixture']= gmm.fit_predict(data_pca)

# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_model
data_model['Gaussian_Mixture']= gmm.fit_predict(data_pca)

# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_pca
data_pca['Gaussian_Mixture']= gmm.fit_predict(data_pca)
In [2682]:
# Let's check the distribution
data_model["Gaussian_Mixture"].value_counts()
Out[2682]:
1    737
4    630
0    431
3    369
2     60
Name: Gaussian_Mixture, dtype: int64

Let's visualize the clusters using PCA.

In [2683]:
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
    sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)

PCA_PLOT(0, 1, data_pca, "Gaussian_Mixture")

Cluster Profiling¶

In [2684]:
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_Gaussian_Mixture = data.groupby("Gaussian_Mixture").mean()
In [2685]:
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_Gaussian_Mixture.style.highlight_max(color="lightgreen", axis=0)
Out[2685]:
  Year_Birth Income Kidhome Teenhome Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Response Age Kids Status Family_Size Expenses NumTotalPurchases Engaged_in_days TotalAcceptedCmp AmountPerPurchase
Gaussian_Mixture                                                                
0 1968.116009 76767.313225 0.013921 0.132251 51.220418 620.044084 66.976798 487.909513 101.807425 68.612529 74.779582 1.048724 4.955916 5.844548 8.034803 2.301624 0.081206 0.106729 0.264501 0.197216 0.013921 0.004640 0.255220 47.883991 0.146172 1.624130 1.770302 1345.350348 19.883991 515.719258 0.918794 68.336971
1 1972.195387 31711.922659 0.854817 0.435550 48.537313 18.434193 3.134328 12.312076 4.666214 3.297151 10.364993 1.758480 1.610583 0.303935 2.857531 6.428765 0.075984 0.004071 0.000000 0.000000 0.002714 0.013569 0.080054 43.804613 1.290366 1.644505 2.934871 41.843962 6.530529 485.981004 0.162822 6.243942
2 1968.600000 67479.133333 0.233333 0.366667 51.316667 769.400000 50.250000 397.300000 57.983333 45.466667 68.900000 1.783333 3.716667 4.466667 5.133333 5.750000 0.100000 0.400000 0.483333 0.333333 0.133333 0.000000 0.516667 47.400000 0.600000 1.616667 2.216667 1320.400000 15.100000 640.566667 1.966667 112.367329
3 1966.287263 63578.964770 0.157182 0.720867 45.607046 568.227642 47.121951 216.214092 62.536585 50.821138 79.525745 3.268293 6.769648 4.783198 9.411924 5.403794 0.089431 0.121951 0.048780 0.081301 0.029810 0.013550 0.176152 49.712737 0.878049 1.628726 2.506775 944.921409 24.233062 629.457995 0.547425 39.293014
4 1967.100000 49336.715873 0.447619 0.738095 50.200000 227.233333 11.347619 72.114286 15.922222 11.206349 39.647619 3.349206 4.912698 1.747619 5.723810 6.020635 0.052381 0.077778 0.001587 0.014286 0.003175 0.004762 0.109524 48.900000 1.185714 1.671429 2.857143 337.823810 15.733333 551.515873 0.258730 20.186831

Let's plot the boxplot

In [2686]:
# Create boxplot for each of the variables

all_col = col_for_box

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(6, 4, i + 1)
    
    sns.boxplot(y=data[variable], x=data['Gaussian_Mixture'],showmeans=True)
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Characteristics of each cluster¶

Cluster 0: High Income no Kids, Middle Age

Summary for cluster 0:___ This group is made up of 431 customers with the highest income and no kids or teens. The have the highest purchases for all food catagories other than Gold products. They have the lowest number of deals purchases, the highest catalog purchases and second highest for web and store purchases. They have the lowest number of web visits per month and are second highest for accepting total number of campaigns, 5 being the one they accepted the most. They have the highest expenses, and are second highest for number of total purchases, and amount per purchase. They are second lowest for engaged in days.

Cluster 1: Low Income with Kids, Youngest

Summary for cluster 1:___ This group is made up of 737 of the youngest customers with low income and kids and teens in the house. They have the lowest purchases for all the food categories and lowest for web, store, and catalog purchases. They are second lowest for deals purchases and highest for web visits per month. They are lowest for accepting campaigns with campaign 3 being the one they accepted best. They have the highest complaints, highest number of kids, largest family size, and are lowest for expenses, total purchases, amount per purchase and engaged in days.

Cluster 2: High/Middle Income with Teens

Summary for cluster 2:___ This group is made of 60 customers with high income and teens in the home. They have the highest recency and highest wine purchases, coming in second for fruit and meat products. They are second lowest for web and store purchases. This group falls in the middle for a lot of other categories but are highest for accepting all the campaigns and response. The have the second highest expenses, but are second lowest for total number of purchases. The are highest for amount per purchase and engaged in days.

Cluster 3: Middle Income with Teens, Oldest

Summary for cluster 3:___ This group is made of 369 of the oldest customers with middle income and teens in the home. They are the middle group for wine, fruit, and meat products and the second highest for fish and sweet products. They are the highest for gold products and second highest for number of deals purchases and catalog purchases. They are highest for number of web purchases and store purchases, second lowest for number of web visits per month. They are the middle group for total accepted campaigns, 4 being the one the accepted most. They are the middle group for expenses and amount per purchase but the top group for number of total purchases. They are second highest for engaged in days.

Cluster 4: Lower/Middle Income with Kids/Teens, Second Oldest

Summary for cluster 4:___ This group is made of 630 of the lower middle income class and the highest number of teens in the home. They are second lowest for all the food categories and the highest users of the deals purchases. They are the middle group for web purchases and store purchases. They are second lowest for catalog purchases and second highest for web visits per month. The are the second lowest group for total accepted campaigns, 4 being the one they accepted most. They are the second oldest group with the highest status and second highest family size. They are second lowest for expenses and amount per purchase. They are the middle group for number of total purchases and are second lowest for engaged in days.

In [2687]:
data_pca.drop("Gaussian_Mixture", axis=1, inplace=True)
data.drop("Gaussian_Mixture", axis=1, inplace=True)

Conclusion and Recommendations¶

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

  • How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

In building our models we first started with the most simple model, K-means. Figure 3 shows the elbow plot we made to see what the ideal number of clusters would be and saw that 3 or 5 would make a good value for k.

We executed different clustering algorithms with k equal to both 3 and 5 and found that k=5 gave us more insight into our customers purchasing behavior and their marketing preferences. We then went on to calculate silhouette score for all of our clustering methods beginning with K-means where our silhouette score for 3 clusters was 0.27 and our silhouette score for 5 clusters was 0.24. Figure 4 below shows the clusters for K-Means with k=3 that gave us our best silhouette score. While this method was a good option that was computationally inexpensive it only showed us basic clusters that lacked complexity.

Second we preformed K-medoids clusters to see if our data responded better. This method is good at reducing noise and outliers and it gave us a silhouette score of 0.11. The clusters for this method were similar to that of K-means but with a lower silhouette score.

Hierarchical clustering led us to find the cophenetic correlation for different distances using different linkage methods. Our highest cophenetic correlation is 0.86, which is obtained with Cityblock distance and average linkage.

We preformed our hierarchal clustering method using euclidean as our distance and ward as our linkage which gave us a silhouette score of 0.25. While this silhouette score was high it only gave us three clusters because our dendrograms did not indicate that 5 was advisable.

When we preformed DBSCAN we found our best silhouette score for eps value =3 and min sample =20. The silhoutte score for this method was 0.34, our best so far however it divided our data into just 2 clusters which would not give us enough information about our customers and this was also the reason why the silhouette score was so high. The issue with using DBSCAN is that there is overlap in our clusters and so multiple clusters got grouped together into one large cluster. We changed the hyperparameters to see if we could get better results without success.

Lastly we used the gaussian mixture model with k=5 and got back a silhouette score of 0.14. With k equal to 5 this gave us the best silhouette score after K-means, out performing K-medoids.

Of the clustering methods that divided our data into 5 clusters K-means preformed the best. We discovered early on that 5 clusters would give us the most insight into our customers buying habits and this remained true throughout the analysis.

2. Refined insights:¶

  • What are the most meaningful insights from the data relevant to the problem?

By implementing unsupervised learning methods such as dimensionality reduction and clustering we were better able to identify patterns in our clients dataset of customers. To better understand our customers and increase ROI we used customer segmentation that will group our clients into like clusters. This helped identify the customers preferred method of purchasing and maximized the clients marketing potential.

To achieve this goal we preformed data analysis and dimensionality reduction that gave us insight into the customers defining attributes. When we began exploring our clustering methods the algorithms success was judged based on the rank of their silhouette scores. We chose the method with the highest silhouette score to ensure that our clusters would be grouped based on their cohesion compared to other clusters. Our most successful clustering method that gave insight into purchasing patterns while also providing diverse customer profiles was K-Means with k equal to 5 which gave us a silhouette score of 0.24. This method gave us five distinct clusters based on income being either high, middle, or low and then separated those classes further according to age and family size. By implementing this clustering method we were able to identify patterns in the customer dataset that allow for more effective marketing campaigns; thus increasing productivity and sales.

3. Proposal for the final solution design:¶

  • What model do you propose to be adopted? Why is this the best solution to adopt?

The model that will produce the best results for the business and the marketing team specifically is K-Means with k=5. These clusters are clearly defined according to income, family size, age, purchasing habits, and campaign interactions. We can also make some valuable suggestions based on the trends in the clusters as to how to treat each group going forward.

While some of the other methods would be interesting to explore, cloud services can get very expensive with large datasets. While our dataset is not particularly large right now it will continue to grow as we add new data and begin to gain new customers.

We can confidently proceed with the implementation of our model, knowing that the clusters that were created are giving us the best possible outcome based on our carefully chosen segmentation attributes and the precise method we used to procure them. We are able to draw viable conclusions about our clusters in regards to buying habits and how receptive each group will be to various types of advertising.

In [ ]: